Single-cell analysis of treatment-resistant prostate cancer: Implications of cell state changes for cell surface antigen–targeted therapies

Significance Treatment of prostate cancer is rapidly evolving with several promising drugs targeting different cell surface antigens. Selection of patients most likely to benefit from these therapies requires an understanding of how expression of these cell surface antigens varies across patients and how they change during disease progression, particularly in tumors that undergo lineage plasticity. Using immunohistochemistry and single-cell mRNA sequencing, we reveal heterogeneity of cell states across a cohort of advanced disease prostate cancer patients; this heterogeneity is not captured by conventional histology-based designations of adenocarcinoma and neuroendocrine prostate cancer. We show that these cell states can be identified by gene regulatory networks that could provide additional diagnostic precision based on their correlation with clinically relevant cell surface antigen expression.

slides were digitized on a Ventana DP 200 Slide Scanner (Roche).All cases were reviewed by experienced genitourinary pathologists (M.R., M.C.H.) and immunoreactivities were scored in a blinded manner using a previously established H-score system, whereby the optical density level ("0" for no brown color, "1" for faint and fine brown chromogen deposition, and "2" for prominent chromogen deposition) was multiplied by the percentage of cells at each staining level, resulting in a total H-score (range 0-200).The final H-score for each sample was the average of two replicate tissue cores.For Ki-67, any nuclear positivity was counted, and the percent of positive nuclei is shown (range 0-100).Included patient identifiers and H-scores for each protein are detailed in Dataset S1.
Comparison H-Score Trends for Lineage Markers Across PRAD/HGC/NEPC.For each histology (PRAD, HGC, and NEPC), H-scores per marker gene were compared using a modified Wilcoxon-ranked test (please refer to https://rdrr.io/github/chvlyl/ZIR/man/ziw.html).P-values were adjusted using Bonferroni correction for the number of tests within each gene (Figure 1B-D and H-I, Supplementary Figure S1A).For YAP1 and EZH2, H-scores were also compared between NEPC versus PRAD plus HGC or PRAD versus HGC plus NEPC respectively, using a modified Wilcoxon-ranked test.Furthermore, scatter plots were generated for YAP1, SYP, DLL3, INSM1, or TROP2 versus ASCL1 or NEPC score (defined as mean H-score of ASCL1, INSM1, and FOXA2), or EZH2 versus Ki67.To evaluate for concordance between the H-scores of two genes of interest, Lin's concordance correlation coefficient (CCC) was calculated (Figure 1F-1G and Supplementary Figure S1D and S1F).For genes without clear linearity, CCC was not calculated, but rather the data was simply visualized and displayed as a scatter plot (Supplementary Figure S1C and S1E).For EZH2 and Ki-67, as the H-score and proliferative index were on different scales (either 0-200 or 0-100, respectively), Pearson's correlation was used to define the strength of correlation (Figure 1E).

Sample Handing and Characteristics for Single Cell RNA-Sequencing
Patient derived tissue.Tumor tissues were collected from patients undergoing a surgical resection or tissue biopsy for clinical care at Memorial Sloan Kettering Cancer Center (MSKCC).
Samples were freshly derived from tumor sites from patients with untreated (naïve) or CSPC, or CRPC.All CRPC patients in this cohort had progressed after treatment with antiandrogen therapy and a next-generation anti-androgen therapy (ARSI), with or without a taxane.Serum PSA levels were reported from the date one week before or after of biopsy.
Detailed clinical characteristics including AJCC staging, location, PSA, type and number of prior treatments, and genomic alterations are listed in Dataset S3.Routine anatomic pathology and immunohistochemistry were performed by a MSKCC pathologist (A.G.).For CRPC single cell biopsies, distinct from the categorization of TMA samples, patients were classified into either CRPC adenocarcinoma (CRPC-adeno) or neuroendocrine with characteristic small cell features (NEPC).
Targeted sequencing of human tumor tissue.For patients who consented to IRB 12-245 (NCT 01775072; MSK-IMPACT) (7) and for clinical purposes, targeted sequencing was performed for a panel of actionable cancer genes for alterations (available for 14 patients, Dataset S3, also subset previously reported in ( 6)).Genomic data was collected on retrospective review of the electronic medical record.
Sample processing of tissue.Fresh benign and malignant tissue was mechanically cut using a scalpel into small pieces (~1-5 mm 3 ).The tissue was then processed and dissociated in 5-10 mg/ml collagenase type II (Gibco) solution in adDMEM/F12+/+/+ with 10 µM Y-27632 dihydrochloride for 30 minutes to 2 hours on a 37 o C shaking platform.This was followed by a 1 minute 0.5 M EDTA wash at room temperature, and subsequent digestion with TrypLE (Gibco) with 10 µM Y-27632 dihydrochloride for 5-10 minutes at 37 o C on a shaking platform until a single cell suspension was obtained.If more than 10% of doublets were present (visual inspection) or there was evidence for <80% viability (hemocytometer, using 0.2% Trypan Blue), then cells were FACS-sorted for singlets and viability using a DAPI.
Sample processing: 10X single cell RNA-sequencing.Dissociated cells were subjected to scRNA-seq using 10X genomics Chromium Single Cell 3' Library and Gel bead Kit (v3 for human, except v2 for HP95T as previously described ( 5)) per manufacturer's protocol.~3000 to 10,000 cells per sample was encapsulated and barcoded following the manual.Sample viability varied between 72 and 95% (0.2% Trypan Blue).The final sequencing libraries were double-size purified (0.6-0.8X) with SPRI beads and sequenced on Illumina Nova-Seq platform (R1-26 cycles, i7-8 cycles, R2-70 cycles or higher).For human samples, on average, 4,864 cells per clinical biopsy were sequencing at a depth of ~ 82,429 reads per cell (~258 million reads per sample).The unique mapping was high, between 62.9-82.6%(except MSK-HP07A/B 58.6% and 60.5%), with a median number of unique transcripts per cell being 7,492.

Single Cell RNA-Seq Processing and Analyses
Preprocessing of scRNA-sequencing dataset.For each human sample, FASTQ files produced from 10X scRNA-seq were mapped to the pre-built human reference (hg38) and converted to counts using the pipeline "cellranger count" (8).Cells were filtered out (or removed) based on two criteria: (1) high fraction of mitochondrial molecules > 25% and (2) low library complexity expressing very few unique genes (<200 genes).Furthermore, putative doublets were removed using the 'DoubletFinder' package (9).Combining samples in the entire cohort of Naïve or CSPC, CRPC-adenocarcinoma and NEPC yielded a count matrix of 119,083 cells x 33,473 transcripts with a median of 8,268 molecules per cell and a median of 4,096 cells per biopsy.The count matrix was then normalized by (1) feature counts divided by total feature counts in each cell, (2) scaled by 10,000 ('scale.factor'),and (3) subjected to natural-log transformation with a pseudo-count of 1 ('NormalizeData').Thereafter, cell-cycle signatures were calculated ('CellCycleScoring') and the signature scores were regressed.The features were then scaled and centered ('ScaleData').Principal component analysis (PCA) was based on 2000 highly variable genes (HVG) (identified using 'FindVariableFeatures', excluding mitochondrial, ribosomal, and with MALAT1, UBA52, NEAT1, TMSB4X, and TMSB10--hererin referred to house-keeping genes) and PCA analysis was performed on the scaled matrix of HVGs with the top 30 principal components (PCs) detected by knee-point (33% variance explained).

Batch Correction
Batch correction for subsetted coarse cell type.Batch correction was not necessary for the identification of coarse cell types (i.e.epithelial/neuroendocrine, lymphoid, myeloid, and stromal).
However, upon subsetting each coarse cell type, patient-level batch effect was observed with cells clustering by patient rather than by cell type (e.g.multiple Treg populations with FOXP3 expression).Therefore, batch correction was performed on each coarse cell type using fastMNN (further detailed below in 'Human Cell Type Annotation') due to the ability to perform hierarchical merging similar histologies (i.e.CRPC-adenocarcinoma or NEPC).To evaluate the effect of batch correction, we utilized an entropy-based measure that quantifies how much normalized expression mixes across patients (10,11).This demonstrated that immune cells had the highest entropy (followed by stromal), whereas epithelial/tumor cells had the lowest entropy suggestive of increased inter-tumoral diversity (Supplementary Figure S3G).Furthermore, by examining the expression of specific cell type markers, we did not observe an over-correction of granular cell types (e.g.normal epithelium versus tumor, neuroendocrine, Tregs, macrophage and immune subsets, etc.).These results thus balanced the need for batch correction, while maintaining true biologic heterogeneity.

Human Cell Type Annotation
Human coarse cell type identification.We then used a hierarchical strategy for cell typing, initially at coarse resolution (epithelial, immune, versus stroma), and granular cell types (normal versus tumor, neuroendocrine, etc.) using canonical marker gene expression within unsupervised clusters.For coarse resolution cell typing, clustering was performed by first constructing a kNN graph on the Euclidean distance in PCA space with 30 principal components (PCs) as identified by knee-point.Edge weights between two cells were adjusted based on Jaccard similarity (k=30, based on adjusted Rand index), generating a shared nearest-neighbor graph (SNN).Louvain clustering was applied to the SNN with resolution=0.Human designation of epithelial and tumor cells.For EPCAM-positive or SYP/CHGA-positive clusters (N=68,816 cells, including 28,508 CSPC cells, 24,866 CRPC-adenocarcinoma cells, and 15,442 NEPC cells), we projected scaled counts onto the top 30 PCs based on knee-point, restricting to 2,000 highly variable genes and a set of biologically relevant genes (AR, JAK/STAT, EMT, NEPC-related, (6), found in Dataset S13), corresponding to 28.8% of variance explained.Louvain clustering (as described above) was performed with resolution of 0.5 generating 31 clusters.To identify malignant tumor cells within subsetted epithelial/neuroendocrine cells, we evaluated the expression of genes known to be upregulated in malignant prostate cancer, including AR, KLK2, KLK3, FOLH1, STEAP1, STEAP2, TACSTD2, ERG, ETV1, ETV6, ASCL1, NEUROD1, DLL3, CHGA, and CHGB (Supplementary Figure S3B) (12).Specifically, 20 out of 31 clusters showed robust expression of at least 2 markers.If clusters were largely derived from NEPC samples (MSK-HP19, MSK-HP20, MSK-HP21) and showed expression of SYP, CHGA, and ASCL1 or NEUROD1, the clusters were labeled as NEPC.The majority of non-tumor cells (including basal, club cells, and lowly expressing luminal cells) (5) were derived from primary tumors, which are well known to be admixed with benign cells.
Putative tumor cells were again subsetted (N=35,962 cells), and subjected to batch correction using FastMNN, (13) which was applied to the normalized count matrix and reduced to the top 30 PCs (refer to section 'Batch correction' for rationale).Here, we performed Phenograph clustering (k=40), which resulted in 32 clusters.All but one cluster showed consistent expression of tumor associated markers, except for cluster 6 (N=266 cells) which exhibited the distinct expression of ALB, CRP, VTN, A1CF, GC, and AMBP, known to be expressed in hepatocytes, and comprised of cells from patients MSK-HP19, MSK-HP09, MSK-HP17, and MSK-HP18, where the biopsy was derived from liver metastases (Supplementary Figure S3C).We thus concluded that this cluster comprises of benign hepatocytes, and thus was removed from the final set of tumor cells.
These malignant cells, without benign hepatocytes, were subsetted (N=35,696 cells; 4,001 CSPC cells, 20,088 CRPC-adenocarcinoma cells, and 11,607 NEPC cells) and underwent batch correction (refer to section 'Batch correction' for rationale).FastMNN was applied to the normalized count matrix, reduced to the top 20 PCs.Phenograph clustering (k=30) was performed to identify 31 clusters, all of which showed expression of at least 2 tumor associated markers (Supplementary Figure S3B).As a confirmatory approach, we assessed single-cell copy number variation (CNV) profiles in tumor cells, by utilizing inferCNV package, which measures the expression of genes across chromosome positions and compares them to a predesignated reference set of cells (myeloid immune cells).To identify CNV profiles, we used a sliding window approach of 101 genes and considered any deviation from the reference mean of at least 1.5 standard deviation as a copy number variation (using inferCNV package).As expected, we observed an increased number CNVs in malignant cells with highest burden in NEPC ( 14) consistent with prior reports and providing additional support for accurate identification of malignant tumor cells (Supplementary Figure S3D).We classified each Phenograph cluster into Basal, LumA, and LumB using the pre-defined PAM50 model from genefu (https://bioconductor.org/packages/release/bioc/html/genefu.html).Considering that our tumor cells lacked HER2-positivity and that we already filtered out normal cells, we excluded these categorizations from our PAM50 classification.( 15

Measuring inter-patient heterogeneity per cell type
For each tumor type, including Naïve or CSPC, CRPC-adenocarcinoma, and NEPC, we employed an entropy-based approach to assess interpatient diversity, as has been previously described (Figure 2C and Supplementary Figure S3F and S3H) (10,11).We used Phenograph clusters (k=30) derived from a subset of tumor cells (N=35,696 cells), myeloid, lymphoid, and stromal populations.Each cluster C represents a discrete phenotype of a given cell.To account for varying cell numbers across clusters and cell types, we implemented a subsampling strategy.Specifically, we randomly selected 100 cells from each cluster, repeating this 100 times with replacement.We calculated the Shannon entropy of patient frequencies P in each subsample, denoted as Hc : To avoid bias introduced by repetitive sampling of a limited cell pool, clusters containing fewer than 100 cells were excluded from the calculation.The distribution of Shannon entropies, bootstrapped from clusters, was analyzed, and compared using a Bonferroni-adjusted Wilcoxon signed-rank test.

Identification of Single-Cell Regulatory Network Inference
Human regulatory network inference.To understand the TF networks that may account for the observed diversity in CRPC-adenocarcinoma and NEPC tumors, we utilized single-cell regulatory network inference (SCENIC) (16) and obtained a regulon-activity matrix for each cell.
The regulon-activity was than z-scaled as is shown in Figure 2D and Supplementary Figure S4A and S4C.To obtain a set of distinct GRNs that contribute to the observed heterogeneity in CRPC and NEPC samples, we hierarchically clustered Phenograph clusters on the scaled regulon activity matrix, employing the Ward D2 method (Euclidean distance) for both CRPCadenocarcinoma and NEPC clusters.To define common and distinct gene-regulatory networks (GRNs, group of regulons) across samples in an unbiased fashion, we applied a dendrogram height cutoff of 15 based on the adjusted Rand index (ARI) of CRPC-adenocarcinoma (overall >= 0.65 indicating moderate to high recovery) (Supplementary Figure S4B), and for consistency, we applied the same threshold for NEPC.This yielded 10 and 5 GRNs for CRPCadenocarcinoma and NEPC, respectively.Among NEPC GRNs, the 2 smallest GRNs with cells < 600 (NEPC-2 and NEPC-3) lacked distinct regulon activity from their nearest neighbor, and therefore were merged (NEPC-2 with NEPC-HOXD11/SOX6 GRN and NEPC-3 with NEPC-A GRN).This resulted in 10 and 3 final GRNs for CRPC-adenocarcinoma and NEPC, respectively.
To determine the order of 13 defined GRNs, we additionally performed hierarchical clustering on each set of the 10 CRPC-adenocarcinoma or 3-NEPC GRNs, generating the dendrogram shown in the heatmap in Figure 2D.For each GRN, we calculated regulon specificity scores (RSS) (17) on the raw regulon activity matrix with the function calcRSS in SCENIC based on Jensen-Shannon divergence.We then z-scaled the RSS, assigning each regulon to the GRN group where it scored the highest.To rank regulons in each GRN by their activity, we further performed Student's t-test comparing the mean value of a regulon in a given GRN versus all other GRNs.Significantly active regulons within each GRN were defined using the criteria of a p-value < 0.01, a mean regulon activity value ≥ 1.0, and a mean difference of regulon activity in a given GRN versus all other GRNS ≥ 1.0.The mean value and regulon ranking within each GRN can be found in Dataset S5 and S6, respectively.

Robustness Analysis of Human GRNs.
A. Subsampling cells: We iteratively subsampled cells to ensure the robustness of our GRN analysis.We randomly sampled 10, 20, 50 and 100 cells from each Phenograph cluster and then performed an identical set of downstream analyses as with our initial/complete dataset.This includes the following: (1) inferring the regulon activity with SCENIC ( 16), ( 2

B. Subsampling patients:
To further ensure the robustness of our GRN analysis, we subsampled 100 cells in 6, 8, and 10 CRPC or NEPC patients (of 13 total CRPC and NEPC patients) and repeated this process iteratively 10 times with replacement.For each subsampling run, we utilized SCENIC to identify enriched TF/regulons and calculated the proportion of overlapping TFs/regulons between our initial GRNs (Figure 2D) and newly identified subsampled GRNs.For initial and subsampled GRNs that share more than 50% of TFs/regulons, we utilized a Fischer' exact test to determine whether the overlap was significant: P-value < 0.05--constituting a shared and recurrent GRN.By subsampling 6, 8, and 10 patients, we identified eight, nine and nine recurrently identified GRNs that significantly overlapped in our initial GRN assignments (and were found in more than 5 of the iterations) (Supplementary Figure S6).Several of the recurrently noted GRNs had intact AR expression (AR+ CREB3+, IRF2+ Inflammatory, AR+HNF4G+, AR+HOXB13+FOXA1+).Of note, AR+ clusters also demonstrated the most mixing between tumors (as defined by patient entropy, Supplementary Figure S2F).We also noted recurrent GRNs for non-AR-driven pathways, including TCF7L2+, FOSL1+ AP-1, SOX2/4+ Embryonic EMT, and the NEPC GRNS NEPC-A, NEPC-N, and NEPC-H/S.GEMM regulatory network inference.For the GEMM analysis, we obtained previously processed and published scRNA-seq data (6) for which we similarly obtained a regulon-activity matrix for each cell, and z-scaled the matrix.As above, hierarchical clustering was performed on Phenograph clusters (k=30) using the Ward2 method (Euclidean distance) on a scaled regulon activity matrix, and the GRN groups were defined based on a cutoff of h=12 (ARI overall >= 0.64, the second stabilized cutoff) (Supplementary Figure S7A-B).Any GRN less than 400 cells, which by visual inspection lacked distinct groups of unique regulons, was merged to its nearest neighbor.This resulted in 9 final GRNs (Figure 3A).To further determine the order of the defined GRNs, we additionally hierarchically clustered the 9 GRNs, generating the dendrogram shown in the heatmap in Figure 3A.For each GRN, we then calculated regulon specificity scores (RSS) on the raw regulon activity matrix with the function calcRSS based on Jensen-Shannon divergence in SCENIC (17).We then z-scored the RSS, assigning each regulon to the GRN group where it scored the highest.To rank regulons in each GRN by their activity, we further performed Student's t-test comparing the mean value of a regulon in a given GRN versus all other GRNs.Significantly active regulons within each GRN were defined using the criteria of a pvalue < 0.01, a mean regulon activity value ≥ 1.0, and a mean difference of regulon activity in a given GRN versus all other GRNS ≥ 1.0.The mean value and regulon ranking within each GRN can be found in Dataset S8 and S9, respectively.Robustness Analysis of GEMM GRNs.Similar to the human SCENIC robustness analysis, 100 cells were randomly sampled from each Phenograph clusters and were subjected to an identical set of downstream analyses as with our initial/complete dataset.ARI was calculated between down-sampled GRN groups and our complete dataset GRN groups.We noted moderate recovery with an ARI = 0.74.Furthermore, the Venn diagram was constructed to show the number of overlapping TFs between the down-sampled and complete dataset (Fisher's exact test P-value < 0.05) (Supplementary Figure S7C).

Differential Expression and Pathway Analysis:
Identifying DEGs and MAST.For all differential expression, we used the 'FindAllMarkers' function, implementing MAST v. 1.22.0.This package provides a flexible framework for fitting a hierarchical generalized linear model to the expression data.A regression model was executed as below:

Yi,j ~ condition
The condition represents the condition of interest and Yi is the expression level of gene i in cells of cluster j, transformed by natural logarithm with a pseudo-count of 1. Significantly differentially expressed genes were considered with an adjusted p-value < 0.05 and absolute log2fold-change (log2FC) > 0.4 (18).

Comparison of Human and GEMM GRNS.
To compare human and GEMM GRNs, we implemented both GSEA and over-representation analysis (ORA).These analyses identified which human GRNs were enriched/overrepresented in GEMM GRNs and vice versa.We curated the top 5 TFs based on RSS scores and their target genes as noted by SCENIC, to create gene sets for both Human and GEMM GRNs (Dataset S14).For inter-species comparison, we mapped mouse genes to their human counterparts (and vice versa) using the biomaRt package (https://bioconductor.org/packages/release/bioc/html/biomaRt.html), and discarded genes without orthologs.
We then performed differential expression analysis (refer to 'Identifying DEGs and MAST') to generate the input for GSEA and ORA analyses.GSEA was performed for each GRN.

Novel cell surface marker detection in AR-positive, AR-negative and NEPC GRNs.
In conjunction with the investigation of known cell surface markers in clinical development (PSMA, STEAP1, STEAP2, CEACAM5, and DLL3), we characterized the most highly expressed cell surface markers in each AR-positive, AR-negative and NEPC regulons.Significantly upregulated genes for each group (AR-positive, AR-negative and NEPC GRNs) were determined through MAST differential analysis (average log2FC>0.4,adjusted P-value<0.05).We then restricted to published cell surface markers (Dataset S16, from https://wlab.ethz.ch/surfaceome/).
A heatmap was generated with the gene expression of the top 15 genes for each group, along with known cell surface markers in clinical development (Supplementary Figure S10D).DLL3 expression across NEPC and CRPC.DLL3 expression was analyzed in both CRPCadenocarcinoma and NEPC cells using both non-imputed and imputed (MAGIC, k=20, t=1).

Scatter plots were generated for CRPC-adenocarcinoma cells comparing DLL3 expression
versus CHGB and ASCL1 expression, as well as DLL3 expression versus NEPC module score (Supplementary Figure S10H).Furthermore, stacked box and density plots of DLL3 nonimputed and imputed expression (MAGIC, k=20, t=1) per NEPC regulon were generated (Supplementary Figure S10E and Figure 4E, respectively).Of note, NEPC module score (used for y-axis in scatter plot) was calculated using the 'AddModuleScore' function (genes defined in Dataset S13).

Intra-Patient Heterogeneity of MSK-HP13 patient
To address the intra-tumoral heterogeneity of drug target genes, we analyzed each CRPC and NEPC tumor biopsy for variable expression within Phenograph clusters.Specifically, we observed heterogeneity of both PSMA and DLL3 in MSK-HP13.MSK-HP13 tumor cells were    regulon specificity scores (RSS) and their associated target genes for each Human GRN (Methods).GSEA was performed on pre-ranked genes using 10,000 permutations.Gene ranks were calculated based on MAST differential expression (Methods).Dot size denotes thelog(p.adjustedvalue) and gradient from gray to red is normalized enrichment score (scale: -2.5 to 5).ORA was performed on DEGs with average log2FC > 0.25, adjusted p-value < 0.05.Dot size denotes the -log(q-value) and gradient from gray to red is observed over expected ratio GSEA was performed on pre-ranked genes using 10,000 permutations.Gene ranks were calculated based on MAST differential expression (Methods).Dot size denotes thelog(p.adjustedvalue) and gradient from gray to red is normalized enrichment score (scale: -2 to 4).ORA was performed on DEGs with average log2FC > 0.25, adjusted p-value < 0.05.Dot size denotes the -log(q-value) and gradient from gray to red is observed over expected ratio (scale: 0 to 3).Shared GRN pairs having significant enrichment in all iterative tests are outlined in black with the shared top TF labeled (Methods).GSEA was performed on pre-ranked genes using 10,000 permutations.
) z-scaling of the regulon activity, (3) hierarchical clustering of Phenograph clusters (k=20) for both CRPC-adeno and NEPC, (3) selection of the h-cutoff based on ARI and (4) subsequent GRN grouping (h=12 for 10 cells random sampling yielding 15 CRPC GRN and 7 NEPC GRN, h=13 for 20 cells random sampling yielding 15 CRPC GRN and 8 NEPC GRN, h=13 for 50 cells random sampling yielding 14 CRPC GRN and 8 NEPC GRN, and h=12 for 100 cells random sampling yielding 14 CRPC GRN and 8 NEPC GRN), (5) order GRNs by additional hierarchical clustering, and (6) calculation of RSS.For additional details on the complete run, please refer to 'Human regulatory network inference.'To examine the reproducibility ofGRN groups with regulon activity, we calculated the ARI between down-sampled GRN groups and our complete dataset with defined human GRN using all cells.All datasets showed moderate recovery (ARI for 10 cells random sampling = 0.69, ARI for 20 cells random sampling = 0.75, ARI for 50 cells random sampling = 0.84, ARI for 100 cells random sampling = 0.75).A Venn diagram also showed the number of overlapping TFs between then downsampled results and all cells with the significance of the overlap assessed using Fisher's exact test.All down-sampled datasets exhibited a significant overlap of TFs (Fisher's exact test pvalue < 0.05) (Supplementary FigureS5A and S5B).

9 ZBTB7B
Supplementary Figure S8.GEMM and Human CRPC/NEPC Overlap.(A) Dot plot for GSEA (left) and over-representation (ORA) (right) analyses for the enrichment of human GRN gene sets in GEMM GRNs.Gene sets were constructed with the top 5 transcription factors (TFs) based on

(
scale: 0 to 3).Shared GRN pairs having significant enrichment in all iterative tests are outlined in black with the shared top TF labeled (Methods).(B) Dot plot for GSEA (left) and overrepresentation (ORA) (right) analyses for the enrichment of GEMM GRN gene sets in Human GRNs.Gene sets were constructed with the top 5 transcription factors (TFs) based on regulon specificity scores (RSS) and their associated target genes for each GEMM GRN (Methods).
. Ascl1, Neurod1, and Pou2f3 Overlap in SCLC and NEPC (A) Dot plot for GSEA analysis for the enrichment of human GRN gene sets in SCLC subtypes (SCLC-A and SCLC-N).Gene sets were constructed with the top 5 transcription factors (TFs) based on regulon specificity scores (RSS) and their associated target genes for each Human GRN (Methods).GSEA was performed on pre-ranked genes using 10,000 permutations.Gene ranks were calculated based on MAST differential expression (Methods).Dot size denotes thelog(p.adjustedvalue) and gradient from gray to red is normalized enrichment score (scale: -2 to 2).(B) Dot plot for GSEA analysis for the enrichment of GEMM GRN gene sets in SCLC-P subtype.Gene sets were constructed with the top 5 transcription factors (TFs) based on regulon specificity scores (RSS) and their associated target genes for each GEMM GRN (Methods).
Gene ranks were calculated based on MAST differential expression (Methods).Dot size denotes thelog(p.adjustedvalue) and gradient from gray to red is normalized enrichment score (scale: -2 to 2).(C) SCLC-P vs. rest (shown on y-axis) or GEMM Pou2f3 vs. rest (shown on x-axis) were compared using MAST and the log2FC for each gene is shown on the scatter plot.Genes with log2FC > 0.4 (and padj<0.05)are labeled with TFs noted in brown for being enriched in both SCLC-P and GEMM Pou2f3, respectively.(D) Venn diagram shows the overlap of top DEGs (average log2FC > 0.4, adjusted p-value < 0.05) shared between SCLC-P and GEMM Pou2f3.A Fisher's exact test was used for significance of overlap.